Classifying Large Data Sets Using SVM with Hierarchical Clusters
نویسندگان
چکیده
Support vector machine (SVM) has been a promising method for classification and regression analysis because of its solid mathematical foundation which conveys several salient properties that other methods do not provide. However, despite the prominent properties of SVM, it is not as favored for large-scale data mining as for pattern recognition or machine learning because the training complexity of SVM is highly dependent on the size of a data set. Many real-world data mining applications involve millions or billions of data records where even multiple scans of the entire data are too expensive to perform. This paper presents a new method, Clustering-Based SVM (CB-SVM), which is specifically designed for handling very large data sets. CB-SVM applies a hierarchical micro-clustering algorithm that scans the entire data set only once to provide an SVM with high quality samples that carry the statistical summaries of the data such that the summaries maximize the benefit of learning the SVM. CB-SVM tries to generate the best SVM boundary for very large data sets given limited amount of resources. Our experiments on synthetic and real data sets show that CB-SVM is highly scalable for very large data sets while also generating high classification accuracy.
منابع مشابه
High performance of the support vector machine in classifying hyperspectral data using a limited dataset
To prospect mineral deposits at regional scale, recognition and classification of hydrothermal alteration zones using remote sensing data is a popular strategy. Due to the large number of spectral bands, classification of the hyperspectral data may be negatively affected by the Hughes phenomenon. A practical way to handle the Hughes problem is preparing a lot of training samples until the size ...
متن کاملDetecting RNA Sequences Using Two-Stage SVM Classifier
RNA sequences detection is time-consuming because of its huge data set size. Although SVM has been proved to be useful, normal SVM is not suitable for classification of large data sets because of its high training complexity. A two-stage SVM classification approach is introduced for fast classifying large data sets. Experimental results on several RNA sequences detection demonstrate that the pr...
متن کاملSupport vector machine classification for large data sets via minimum enclosing ball clustering
Support vector machine (SVM) is a powerful technique for data classification. Despite of its good theoretic foundations and high classification accuracy, normal SVM is not suitable for classification of large data sets, because the training complexity of SVM is highly dependent on the size of data set. This paper presents a novel SVM classification approach for large data sets by using minimum ...
متن کاملReducing the Size of Very Large Training Set for Support Vector Machine Classification
Normal support vector machine (SVM) algorithms are not suitable for classification of large data sets because of high training complexity. In this paper, we introduce a method based on edge recognition technique to find low-value data, where to keep input data distribution, we use clustering algorithm like k-means to compute clusters centers. Data is selected through edge recognition algorithm ...
متن کاملA natural framework for sparse hierarchical clustering
There has been a surge in the number of large and flat data sets – data sets containing a large number of features and a relatively small number of observations – due to the growing ability to collect and store information in medical research and other fields. Hierarchical clustering is a widely used clustering tool. In hierarchical clustering, large and flat data sets may allow for a better co...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003